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Abstract 

When information is to be transmitted over an unknown, possibly unreliable channel, an 
erasure option at the decoder is desirable. Using constant-composition random codes, we propose 
a generalization of Csiszar and Korner's Maximum Mutual Information decoder with erasure 
option for discrete memoryless channels. The new decoder is parameterized by a weighting 
function that is designed to optimize the fundamental tradeoff between undetected-error and 
erasure exponents for a compound class of channels. The class of weighting functions may be 
further enlarged to optimize a similar tradeoff for list decoders — in that case, undetected-error 
probability is replaced with average number of incorrect messages in the list. Explicit solutions 
are identified. 

The optimal exponents admit simple expressions in terms of the sphere-packing exponent, 
at all rates below capacity. For small erasure exponents, these expressions coincide with those 
derived by Forney (1968) for symmetric channels, using Maximum a Posteriori decoding. Thus 
for those channels at least, ignorance of the channel law is inconsequential. Conditions for opti- 
mality of the Csiszar-Korner rule and of the simpler empirical-mutual-information thresholding 
rule are identified. The error exponents are evaluated numerically for the binary symmetric 
$H ' channel. 

Keywords: error exponents, constant-composition codes, random codes, method of types, 
sphere packing, maximum mutual information decoder, universal decoding, erasures, list decod- 
ing, Neyman-Pearson hypothesis testing. 
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1 Introduction 



Universal decoders have been studied extensively in the information theory literature as they are 
applicable to a variety of communication problems where the channel is partly or even completely 
unknown [1,2]. In particular, the Maximum Mutual Information (MMI) decoder provides univer- 
sally attainable error exponents for random constant-composition codes over discrete memoryless 
channels (DMCs). In some cases, incomplete knowledge of the channel law is inconsequential as 
the resulting error exponents are the same as those for maximum-likelihood decoders which know 
the channel law in effect. 

It is often desirable to provided the receiver with an erasure option that can be exercised when 
the received data are deemed unreliable. For fixed channels, Forney [3] derived the decision rule that 
provides the optimal tradeoff between the erasure and undetected-error probabilities, analogously 
to the Neyman-Pearson problem for binary hypothesis testing. Forney used the same framework 
to optimize the performance of list decoders; the probability of undetected errors is then replaced 
by the expected number of incorrect messages on the list. The size of the list is a random variable 
which equals 1 with high probability when communication is reliable. 

For unknown channels, the problem of decoding with erasures was considered by Csiszar and 
Korner [1]. They derived attainable pairs of undetected-error and erasure exponents for any DMC. 
Their work was later extended by Telatar and Gallager [4]. However neither [1] nor [4] indicated 
whether true universality is achievable, i.e., whether the exponents match Forney's exponents. Also 
they did not indicate whether their error exponents might be optimal in some weaker sense. The 
problem was recently revisited by Merhav and Feder [5], using a competitive minimax approach. 
The analysis of [5] yields lower bounds on a certain fraction of the optimal exponents. It is suggested 
in [5] that true universality might generally not be attainable, which would represent a fundamental 
difference with ordinary decoding. 

This paper considers decoding with erasures for the compound DMC, with two goals in mind. 
The first is to construct a broad class of decision rules that can be optimized in an asymptotic 
Neyman-Pearson sense, analogously to universal hypothesis testing [6-8]. The second is to investi- 
gate the universality properties of the receiver, in particular conditions under which the exponents 
coincide with Forney's exponents. We first solve the problem of variable-size list decoders because 
it is simpler, and the solution to the ordinary problem of size-1 lists follows directly. We establish 
conditions under which our error exponents match Forney's exponents. 

Following background material in Sec. [2j the main results are given in Sees. [3j— {5] We also 
observe that in some problems the compound DMC approach is overly rigid and pessimistic. For 
such problems we present in Sec. [6] a simple and flexible extension of our method based on the 
relative minimax principle. In Sec. [7] we apply our results to a class of Binary Symmetric Channels 
(BSC), which yields easily computable and insightful formulas. The paper concludes with a brief 
discussion in Sec. [8] The proofs of the main results are given in the appendices. 

1.1 Notation 

We use uppercase letters for random variables, lowercase letters for individual values, and boldface 
fonts for sequences. The probability mass function (p.m.f.) of a random variable X £ X is denoted 
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by Px = {px(x), x G X}, the probability of a set Q under px by Px(Q), and the expectation 
operator by E. Entropy of a random variable X is denoted by H(X), and mutual information 
between two random variables X and Y is denoted by I(X; Y) = H(X) — H(X\Y), or by I(pxy) 
when the dependency on pxy should be explicit. The Kullback-Leibler divergence between two 
p.m.f.'s p and q is denoted by D(p\\q). All logarithms are in base 2. We denote by /' the derivative 
of a function /. 

Denote by p x the type of a sequence x G X N (p x is an empirical p.m.f. over X) and by T x the 
type class associated with p x , i.e., the set of all sequences of type p x . Likewise, denote by p xy the 
joint type of a pair of sequences (x, y) G A^ x y N (a p.m.f. over Af x y) and by T xy the type class 
associated with p xy , i.e., the set of all sequences of type p xy . The conditional type p y \ x of a pair of 
sequences (x, y) is defined y)/px(x) for all x G A 7 such that p x (x) > 0. The conditional 

type class T y | x is the set of all sequences y such that (x, y) G T xy . We denote by H(x) the entropy 
of the p.m.f. p x and by i2"(y|x) and 7(x; y) the conditional entropy and the mutual information 
for the joint p.m.f. p xy , respectively. Recall that [1] 

(N + 1)H*I 2 NH W < |T X | <2^«, (1.1) 

( N + l)-\X\\y\ 2 NH(y\*) < | Ty|x | < 2 ^(y|x)_ (L2) 

We let and represent the set of all p.m.f.'s and empirical p.m.f.'s, respectively, for 

a random variable X. Likewise, 0^y\x an d ^ypf denote the set of all conditional p.m.f.'s and 
all empirical conditional p.m.f.'s, respectively, for a random variable Y given X. The notation 
f(N) ~ g(N) denotes asymptotic equality: limjv^oo j^j^ = 1. The shorthands /(A/") = g(N) and 

/(-^V) < 5 , (^ r ) denote equality and inequality on the exponential scale: limA^oo In = and 
limjv->oo jij In ^l^j < 0, respectively. We denote by l{ xg n} the indicator function of a set and 

define = max(0, t) and exp 2 (t) = 2*. We adopt the notational convention that the minimum of 
a function over an empty set is +oo. 

The function-ordering notation F ■< G indicates that F(t) < G(t) for all t. Similarly, F >z G 
indicates that F(t) > G(t) for all t. 



2 Decoding with Erasure and List Options 

2.1 Maximum-Likelihood Decoding 

In his 1968 paper [3], Forney studied the following erasure/list decoding problem. A length- N, 
r&te-R code C = {x(m), m G M} is selected, where M. = {1, 2, ■ ■ ■ , 2^} is the message set and 
each codeword x(m) G X N . Upon selection of a message m, the corresponding x(m) is transmitted 
over a DMC py\x '■ X ^ y. A set of decoding regions V m C 3^^, m G A^, is defined, and the 
decoder returns rh = g(y) if and only if y G V m . For ordinary decoding, {V m , m G A4} form a 
partition of y N . When an erasure option is introduced, the decision space is extended to M U 0, 
where denotes the erasure symbol. The erasure region Vq is the complement of U me >iP m in 3^- 
An undetected error arises if m was transmitted but y lies in the decoding region of some other 
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message i 7^ m. This event is given by 

Si- 



(m,y) :ye (J A }> , (2.1) 

i£M\{m} 



where the subscript i stands for "incorrect message" . Hence 

Pr W = r^T E E p?i*(yl*M) 

= E E E P?|*(y|*(m)), M 

1 1 mGA4 iGA4\{m} ygDj 

where the second equality holds because the decoding regions are disjoint. 
The erasure event is given by 

h = {(™,y) : yePj) 

and has probability 

Pr W 4EE Pr\x<y\x<rn))- ( 2 -3) 

1 1 mEM yGX> 

The total error event is given by £ err = £\ U £®. The decoder is generally designed so that Pr[£i] <C 
Pr[S 9 ], so Pr[£ err ] « Pr[£ ]. 

Analogously to the Neyman-Pearson problem, one wishes to design the decoding regions to 
obtain an optimal tradeoff between Pr[£i] and Pr[8^\. Forney proved the following class of decision 
rules is optimal: 

f rh : if pE x (y\x(m)) > e NT ^p$ ]x (y\x.(i)) 
9ml{y) = { V* ' (2.4) 

[ : else 

where T > is a free parameter trading off Pr [£\] against Pr [£ 0] . The nonnegativity constraint on 
T ensures that rh is uniquely defined for any given y. There is no other decision rule that yields 
simultaneously a lower value for Pr[£j and for Prf^]. 

A conceptually simple (but suboptimal) alternative to (12. 4h is 

( rh : if p^ |x (y|x(m)) > e iVT maxp^ |x (y|x(i)) 

<?ML, 2 (y) = < „ 1 l * m ' (2.5) 

[ : else 

where the decision is made based on the two highest likelihood scores. 

If one chooses T < 0, there is generally more than one value of rh that satisfies (|2.4p . and c/ml 
may be viewed as a list decoder that returns the list of all such rh. Denote by N[ the number 
of incorrect messages on the list. Since the decoding regions {V m , m £ M.} overlap, the average 
number of incorrect messages in the list, 

E M = TM £ £ £ p£| X (y|x(m)), (2.6) 

' ' m€M i£M\{m} yeV % 
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no longer coincides with Pr[£j] in (|2.2p . 

For the rule (|2.4p applied to symmetric channels, Forney showed that the following error expo- 



nents are achievable for all A such that R con 3 < R + A < C: 

Ei(R,A) = E sp (R) + A 

E 9 (R,A) = E sp (R + A) (2.7) 

where E sp (R) is the sphere-packing exponent, and R con3 is the conjugate rate, defined as the rate 
for which the slope of E sp (-) is the inverse of the slope at rate R: 

E' (R conj ) = — 

The exponents of (|2.7j) are achieved using independent and identically distributed (i.i.d.) codes. 
2.2 Universal Decoding 

When the channel law py\x is unknown, maximum-likelihood decoding cannot be used. For 
constant-composition codes with type px, the MMI decoder takes the form 

9MMi{y) = argmax I(x(i);y). (2.8) 

Csiszar and Korner [1, p. 174 — 178] extended the MMI decoder to include an erasure option, using 
the following decision rule: 

f rh : if J(x(m);y) > R+ A + A max|/(x(i); y) - R\ + 
9XA(y) = { rt ' i&h • (2.9) 

[ : else 

where A > and A > 1. They derived the following error exponents for the resulting undetected- 
error and erasure events: 

{E t ,\{R,Px,Py\x) + E rA/x (R + A,px,p Y \x)}, ^Py\x 

where 

E r ,\(R,px,PY\x) = W n { D (.PY\x\\PY\x\Px) + MHpx,Py\x) 

Py\x 

While A and A are tradeoff parameters, they did not mention whether the decision rule (|2.9p 
satisfies any Neyman-Pearson type optimality criterion. 

A different approach was recently proposed by Merhav and Feder [5]. They raised the possibility 
that the achievable pairs of undetected-error and erasure exponents might be smaller than in the 
known-channel case and proposed a decision rule based on the competitive minimax principle. 
This rule is parameterized by a scalar parameter < £ < 1 which represents a fraction of the 
optimal exponents (for the known-channel case) that their decoding procedure is guaranteed to 
achieve. Decoding involves explicit maximization of a cost function over the compound DMC 
family, analogously to a Generalized Likelihood Ratio Test (GLRT). The rule coincides which the 
GLRT when £ = 0, but the choice of £ can be optimized. They conjectured that the highest 
achievable £ is lower than 1 in general, and derived a computable lower bound on that value. 
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3 J^-MMI Class of Decoders 



3.1 Decoding Rule 

Assume that random constant-composition codes with type px are used, and that the DMC py\x 
belongs to a connected subset W of &y\x- The decoder knows W but not which py\x i s i n effect. 

Analogously to (12. 9p . our proposed decoding rule is a test based on the empirical mutual in- 
formations for the two highest-scoring messages. Let T be the class of continuous, nondecreasing 
functions F : [-R, H(px) — R] — ► R- The decision rule indexed by F £ T takes the form 

( rh : if I(x(rfi);y) > R + maxF(I(x(i);y) - R) 
g F {y) = { (3.1) 
[ : else. 

Given a candidate message rh, the function F weighs the score of the best competing codeword. 
Since < y) < H(px), all values of F(t) outside the range [— R, H(px) — R] are equivalent 

in terms of the decision rule (13.11). 



The choice F(t) = t results in the MMI decoding rule (|2.8h . and 

F(t) = A + X\t\ + (3.2) 

(two-parameter family of functions) results in the Csiszar-Korner rule (|2.9p . 

One may further require that F(t) > t to guarantee that rh = argmaxj I(x(i);y), as can be 
verified by direct substitution into (|3,ip . In this case, the decision is whether the decoder should 
output the highest-scoring message or output an erasure decision. 

When the restriction F(t) > t is not imposed, the decision rule (13. ip is ambiguous because more 
than one rh could satisfy the inequality in (|3,ip . Then (|3,ip may be viewed as a list decoder that 
returns the list of all such rh, similarly to (j2.4[) . 

The Csiszar-Korner decision rule parameterized by F in (|3.2p is nonambiguous for A > 1. Note 
there is an error in Theorem 5.11 and Corollary 5.11A of [1, p. 175], where the condition A > 
should be replaced with A > 1 [9]. 

In the limit as A { 0, (|3.2[) leads to the simple decoder that lists all messages whose empirical 
mutual information score exceeds R + A. If a list decoder is not desired, a simple variation on (13. 2p 
when < A < 1 is 

It is also worth noting that the function F(t) = A + t may be thought of as an empirical version 
of Forney's suboptimal decoding rule (|2.5|) . with T = A. Indeed, using the identity J(x;y) = 
H(y) — ff(y|x) and viewing the negative empirical equivocation 

--H"(y|x) = ^2p xy (x,y) lnp y \ x (y\x) 
as an empirical version of the normalized loglikelihood 



h^py lx (y\x) = ^2 p xy (x,y) \n p Y \x{y\c 



N 
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we may rewrite (|2.5[) and (J3JJ) respectively as 

( m : if ^lnp^ |x (y|x(m)) > T+ max ^ lnp^ |x (y|x(i)) 

9ML,2(y) = S rt ^ m (3-3) 

[ : else 

and 

{m : if — i?(y|x(m)) > A + max [— i?(y|x(i))] 
• (3.4) 
: else. 

While this observation does not imply F(t) = A+t is an optimal choice for F, one might intuitively 
expect optimality in some regime. 



3.2 Error Exponents 

For a random-coding strategy using constant-composition codes with type px, the expected number 
of incorrect messages on the list, E[JVi], and the erasure probability, Pr[8$\ } may be viewed as 
functions of R, px, Py\x-> an d F. A pair {Ei(R,px,Py\Xi F), Eq(R,px,Py\Xi F)} of incorrect- 
message and erasure exponents is said to be universally attainable for such codes over W if the 
expected number of incorrect messages on the list and the erasure probability satisfy 

E[Ni] < exp2{-NMR,p x ,PY\x,F)-e]}, (3-5) 
Pr[8 % ] < exp 2 {-N[E $ (R, PX ,PY\x,F)-e]}, Vp y , x G W, (3.6) 

for any e > and N greater than some iVo(e). The worst-case exponents (over all py\x £ W) are 
denoted by 

Ei(R, Px ,W,F) 4 min Ei(R, Px ,PY\x,F), (3.7) 

Py\x^ 

E 9 (R, Px ,W,F) 4 mm E 9 (R, Px ,Py\x,F). (3.8) 

Py\x^ 

Our problem is to maximize the erasure exponent E^(R,px,'^ / , F) subject to the constraint 
that the incorrect-message exponent Ei(R,px,^,F) is at least equal to some prescribed value a. 
This is an asymptotic Neyman-Pearson problem. We shall focus on the regime of practical interest 
where erasures are more acceptable than undetected errors: 

E 9 (R, PX ,W,F) < E { {R,p x ,W,F). 

We emphasize that asymptotic Neyman-Pearson optimality of the decision rule holds only in a 
restricted sense, namely, with respect to the F-MMl class (|3.ip . 

Specifically, given R and W , we seek the solution to the constrained optimization problem 

E% (R, W, a) = max max min E$(R,px,Py\Xi F) (3-9) 

v Px F&T(R,px,W,a) p Y \x^W 

where T(R,px, W, a) is the set of functions F that satisfy 

min Ei(R,px,p Y \x,F) > a (3.10) 

Py\x^ 
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as well as the continuity and monotonicity conditions mentioned above (|3.ip . 

If we were able to choose F as a function of Py\Xi we would do at least as well as in (|3.9p and 
achieve the erasure exponent 

EX*(R,W,a) = max min max Eq(R,px,Py\x>F) 

Px Py\x£W F£F{R,p x ,p Y{x ,a) 

> E* $ (R,W,a). (3.11) 

We shall be particularly interested in characterizing (i?, W, a) for which the decoder incurs no 
penalty for not knowing py\x> i- e -; 

• equality holds in (|3.1ip . and 

• the optimal exponents Ei(R,px,py\Xi F) an d E$(R,px,Py\Xi F) m <|3.10j) and (|3.9[) coincide 
with Forney's exponents in (12. 7|) for all py\x £ "W , 

the second property being stronger than the first one. 
3.3 Basic Properties of F 

To simplify the derivations, it is convenient to slightly strengthen the requirement that F be 
nondecreasing, and work with strictly increasing functions F instead. Then the maxima over F in 
(|3.9p and (|3.1ip are replaced with suprema, but of course their value remains the same. 
To each monotonically increasing function F corresponds an inverse F~ l , such that 

F(t) = u & F~ 1 (u) = t. 

Elementary properties satisfied by the inverse function include: 

(PI) F^ 1 is continuous and increasing over its range. 

(P2) If F X G, then F" 1 >z G~ x . 

(P3) G(t) = F(t) + A & G~ l {t) =F- 1 {t- A). 

(P4)^ = l/^. 

(P5) If F is convex, then F~ l is concave. 

(P6) The domain of F~ l is the range of F, and vice-versa. 

Now for any F such that F(H{px) — R) > 0, define the scalar 

t F 4 sup {t : F(t) = \F(t')\ W < t < H(p x ) - R} (3.12) 

which may depend on R and px via the difference H{px) — R- From this definition we have the 
following properties: 

• |.F(i)| + is constant for all t < tp\ 

• F{t) > for t > t F . 

We have tF = if F(t) is chosen as in (|4.5p . or if F(t) = A + A|t| + . If F has a zero-crossing, tF 
is that zero-crossing. For instance, tp = —A/A if Fit) = A + Xt. Or tp = mm{A, H(px) — R} if 
Fit) = a\t — A| + . The supremum in (|3.12p always exists. 
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4 Random-Coding and Sphere-Packing Exponents 



The sphere packing exponent for channel py\x is defined as 

E sp (R,px,p Y \x) - min D(p Y \ x \\PY\x \Px) (4.1) 

Py\x ■I\Px,Py\x)<R 

and as oo if the minimization above is over an empty set. The function E sp (R,px,PY\x) is convex, 
nonincreasing, and differentiable in R, and continuous in py\ x . 

The sphere packing exponent for class W is defined as 

E sp (R, Px ,W)± min E sp (R, Px ,p Y]x ). (4.2) 

Py\x^ 

The function E sp (R,p Xl W) is differentiable in R because W is a connected set. In some cases, 
E sp {R^px^) is convex in R, e.g., when the same py\x achieves the minimum in (|4.2p at all 
rates. Denote by Roo(px,^) the infimum of the rates R such that E sp {R,px^) < oo, and by 
I{px-, W) = mm p Y \ x ^ I(Px, Py\x) the supremum of R such that E sp (R,px, W) > 0. 

The modified random coding exponent for channel Py\x an d for class W are respectively defined 

as 

E r ,F(R,Px,PY\x) - ™n[D(p Y \x\\PY\x \px) + F(I(px,Py\x) ~ R)] (4.3) 

Py\x 

and 

E r , F (R, Px ,W)± min E r , F (R,px,p Y \x)- (4.4) 

Py\x^W 

When F(t) = \t\ + , (|4.3p is just the usual random coding exponent. It can be verified that (|4.3p is 
a continuous functional of F. 

Define the function 

F R>pXj w(t) ± E sp (R, Px ,W) - E sp (R + t,p x ,W) (4.5) 

which is depicted in Fig. Q] for a BSC example to be analyzed in Sec. This function is increasing 
for Roo{px, TV) — R < t < I(px, W) — R and satisfies the following properties: 

Fr, Px M°) = 

F kpxA l ) = -E' sp (R + t, Px ,W) 
E sp (R', Px ,W)+F RtPx ^(R' -R) = E sp (R, Px ,W). (4.6) 

If E sp (R,p x ,W) is convex in R, then F^ px ^{t) is concave in t. 

Proposition 4.1 The modified random coding exponent E r ^ F (R,px,Py\x) satisfies the following 
properties. 

(i) Er t p(R, p x , py\x) is nonincreasing in R. 

(ii) If F -<G, then E rtF (R,p x ,p Y \ x ) < E r>G (R,p x ,p Y \ x ). 

(Hi) E rt F(R,px,PY\x) is related to the sphere packing exponent as follows: 

E r , F (R,px,p Y \x) = mm[E sp (R',px,p Y \x) + F(R' - R)}. (4.7) 

(iv) The above properties hold with W in place of Py\x * n ^e arguments of the functions E rjF and 
E 
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0.7 - 




Rate (R') -> 



Figure 1: Function F^ Px ^(-—R) when R = 0.1, W is the family of BSC's with crossover probability 
p < 0.1 (capacity C{p) > 0.53), and px is the uniform p.m.f. over {0, 1}. 

The proof of these properties is given in the appendix. Part (iii) is a variation on Lemma 5.4 
and its corollary in [1, p. 168]. Also note that while E r! F(R,px,PY\x) 1S convex in R for some 
choices of F, including (|3.2p . that property does not extend to arbitrary F. 

Proposition 4.2 The incorrect-message and erasure exponents under the decision rule \3. 1\) are 
respectively given by 

Ei(R,p x ,p Y \x,F) = Er iF {R,px,p Y \x), (4.8) 
E<&{R,Px,Py\x,F) = E r) \ F -i\+(R,px,PY\x)- (4.9) 

Proof: see appendix. 

If F(t) = + , then |i ?_1 (t)| + = \t\ + , and both (14. 8p and (|4.9f) reduce to the ordinary random- 
coding exponent. 

If the channel is not reliable enough in the sense that I(px,Py\x) — R + F(0), then 
F^ilipxiPvix) ~ R) < ^ _1 (^(0)) = 0, and from (gSJ we obtain Eq(R, Px ,Py\x,F) = be- 
cause the minimizing Py\x i n the expression for E r \ F -i\+ is equal to Py\x- Checking (|3.ip . a 
heuristic explanation for the zero erasure exponent is that I(x(m);y) ~ I(px ,Py\x) with high 
probability when m is the transmitted message, and maxj^ m J(x(m); y) ~ R (obtained using ()B.4p 
with u = R and the union bound). 
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5 ^-Optimal Choice of F 



In this section, we view the incorrect-message and erasure exponents as functionals of \F\ + and 
examine optimal tradeoffs between them. It is instructive to first consider the one-parameter family 



F(t) = A > 0, (5.1) 

which coresponds to a thresholding rule in (|3.ip . Using (|4.7p it is easily verified that 

E i (R,px,p Y \x,F)=E r , F (R,p x ,p Ylx ) = A, (5.2) 

E<t,(R,p x ,PY\x,F) = E sp (R + A,px,p Y \x)- (5-3) 



Two extreme choices for A are and I{j)x — R because in each case one error exponent is 
zero and the other one is positive. One would expect that better tradeoffs can be achieved using a 
broader class of functions F, though. 

Recalling (|3.9h (j3.10j) and using f)4.8|) and (I4.9p . we seek the solution to the following two 
asymptotic Neyman-Pearson optimization problems. For list decoding, find 

E^(R,W,a) =maxE^(R,p x ,W,a) (5.4) 
Px 

where the cost function 

Efi(R,p x ,W,a)= sup E r>lF -i l+ (R,px,W). (5.5) 

where the feasible set J rL (R,px,^ / , a) is the set of continuous, increasing functions F that yield 
an incorrect-message exponent at least equal to a: 

E i {R, Px ,W,F) = E r , F (R,px,W)>a. (5.6) 

For classical decoding (list size < 1), find 

EJR,W,a) = max EJR^x,^, a) (5.7) 
Px 

where 

E 9 (R, Px ,W,a) = sup E rt]F -i ]+ (R,p x ,W) (5.8) 

F£r{R,p x ^,a) 

and F{R,px,y^ ,ot) is the subset of functions in J rL (R,px,'^ / ,a) that satisfy F(t) > t, i.e., the 
decoder outputs at most one message. 

Since T{R,pxi *W , a) C J rL (R,px,'% / ,a), we have 

E % {R, PX} W,a) < E%(R,p x ,W,a). 

The sequel of this paper focuses on the class of variable-size list decoders associated with T L because 
the error exponent tradeoffs are at least as good as those associated with J 7 , and the corresponding 
error exponents take a more concise form. 
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Define the critical rate R cr (px, W) as the rate at which the derivative E\ ' (-,px, W) = — 1, and 



sp \ 

A 4 a -E sp {R,p x ,W). (5.9) 

Two rates R\ and R2 are said to be conjugate given px and W if the corresponding slopes of 
E sp (-,Px are inverse of each other: 

^'"'^- ^Ly) - (5 ' 10> 

The difference <i between two conjugate rates uniquely specifies them. We denote the smaller one by 
Ri(d) and the larger one by R2(d), irrespective of the sign of d. Hence R%(d) < R C r(px,^) < R2(d), 
with equality when d = 0. We also denote by R conj (px,'% / ) the conjugate rate of R, as defined 
by (I5.10p . The conjugate rate always exists when R is below the critical rate Rcripx,^)- If R is 
above the critical rate and sufficiently large, R con:, (px, W) may not exist. Instead of treating this 
case separately, we note that this case will be irrelevant because the conjugate rate always appears 
via the expression max{i?, R COUJ (px, W)} which is equal to R if R > R cr (px, and therefore this 
expression is always well defined. 

The proofs of Prop. 15.11 and 15.31 and Lemma 15.41 below may be found in the appendix; recall 

Fr. 

Px,^(f) was defined in ((I3J). The proof of Prop.[53]parallels that of Prop.EZQn) and is therefore 
omitted. 



Proposition 5.1 The suprema in ( 15. 51) and ( 15. 8\) are respectively achieved by 

F L *(t) = F RtPx>w (t) + A (5.11) 

= a-E sp (R + t,p x ,W) (5.12) 

and 

F*{t) =max(t, F L *(t)). (5.13) 

The resulting incorrect-message exponent is given by Ei(R,px,^) = oc. The optimal solution is 
nonunique. In particular, for t < 0, one can replace F L *(t) by the constant F L *(0) without effect 
on the error exponents. 



The proof of Prop. 15.11 is quite simple and can be separated from the calculation of the error 
exponents. The main idea is as follows. If G ^ F, we have E r ^{R,px^) < E r) p(R,px,'^ / ). 
Since G^ 1 t F -1 , we also have E rG -i(R,p x ,W) > E^ F -x{R,p x ^). Therefore we seek F* G 
T L (R,px,W,a) such that F* ^ F for all F G F L (R^p x , W, a). Such F*, assuming it exists, 
necessarily achieves E^(R,px,^ ',&). The same procedure applies to J 7 (R,px,^ / ,a). 

Corollary 5.2 If R > I(px,^), the thresholding rule F L *(t) = A of h5.1\) is optimal, and the 
optimal error exponents are E\{R,px,W , a) = A and E«(R,px, W, a) = 0. 

Proof. Since R > I{p x ,W), we have E sp (R,p x ,W) = 0. Hence from (|4"3|K F RyPx ^(t) = 
for all t > 0. Substituting into (15. lip proves the optimality of the thresholding rule (15. ip . The 
corresponding error exponents are obtained by minimizing (|5.2p and (|5.3p over py\x £ • ^ 

The case R < I(px, W) is addressed next. 
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Proposition 5.3 If E sp (R, px , 7W) is convex in R, the optimal incorrect-message and erasure ex- 
ponents are related as follows. 

(i) For \R con 'J(p x , W) - R\ + < A < I( Px , W) - R, we have 

Ek(R,p x ,^,a) = E sp (R, Px ,W) + A = a 

E^(R, Px ,W,a) = E sp (R + A, Px ,W). (5.14) 

(ii) The above exponents are also achieved using the penalty function F(t) = A + X\t\ + with 

-g„WB.F)<A< _ E , (a+ ' AiB|y| . (5.15) 

(iii) If R < R cr (p x , TP) andO<A< R con i( Px ,W) - R, we have 

Ei{R,p x ,W,a) = E sp (R,p x ,W) + A = a 

E^(R,p x ,W,a) = E sp (R 2 (A),p x ,^) + F^ x ^(R 1 (A)-R). (5.16) 

(iv) IfR < R cr (px, IP) and R^px ,W) - R < A < 0, we have 

Ei(R,px,W,a) = E sp {R,p x ,W) + A = a 

E^(R, Px ,^,a) = E sp (R 1 (A),px,W) + F-^ }x>w (R 2 (A)-R). (5.17) 



Part (ii) of the proposition implies that not only is the optimal F nommique under the combi- 
nations of (R,px, W, ot) of Part (i), but also the Csiszar-Korner rule (|2.9p is optimal for any (A, A) 
in a certain range of values. 

Also, while Prop. 15.31 provides simple expressions for the worst-case error exponents over W , 
the exponents for any specific channel py\x £ W are obtained by substituting the function (|5.12p 
and its inverse, respectively, into the minimization problem of (|4.7p . This problem does generally 
not admit a simple expression. 

This leads us back to the question asked at the end of Sec. [21 namely, when does the decoder 
pay no penalty for not knowing py\x^ Defining 

E! sp (R,p X ,^)= min E'(R, P x,Py\x), (5-18) 

and 

R COn \px,^)= max : R^(px,p Yl x) (5-19) 
we have the following lemma, whose proof appears in the appendix. 
Lemma 5.4 

E' sp (R,p x ,W) > E^ p (R,px,^), (5.20) 

R Con3 (p x ,W) > R con \p x ^) (5.21) 

with equality if the same py\x minimizes E sp (R,px,PY\x) a t all rates. 
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Proposition 5.5 Assume that R, px, W , A and A are such that 

\TP n '(px,*')-R\ + <A <i{ P x,rn-R, (5.22) 

■E!, P (^,w) <a < _^ R+ \, PX ~, (5 ' 23> 



T/ien £/ie pair of incorrect-message and erasure exponents 

{E sp (R, Px ,py lx ) + A, E sp {R + A,px,p Y \x)} (5.24) 

is universally attainable over py\x £ W using the penalty function F(t) = A + A|t| + ; and equality 
holds in the erasure- exponent game of \3. 

Proof. From (|5.19p and (|5.22p . we have 

\R conj (Px,P Y \x) - R\ + < A < I( P x,Py\x) ~ R, VPy\x G 
Similarly, from (|5.18p and (|5.23p . we have 

-E' sp (R, Px ,p Ylx ) < A < * -, Vp Y \ x G 

~ E s P {R + A,px,Py\x) 

Then applying Prop. I5.3l fii) with the singleton {py\x} m place of W proves the claim. □ 

The set of (A, A) defined by (|5.22|) . (I5.23P is smaller than that of Prop. I5.3( i) but is not empty 
because E/ sp (R + A,p x , W) tends to zero as A approaches the upper limit I{px, ^) — R- Thus the 
universal exponents in (15.241) hold at least in the small erasure-exponent regime (where E sp (R + 
A,px,Py\x) ~* 0) an d coincide with those derived by Forney [3, Theorem 3(a)] for symmetric 
channels, using MAP decoding. For symmetric channels, the same input distribution px is optimal 
at all rates. Qj Our rates are identical to his, i.e., the same optimal error exponents are achieved 
without knowledge of the channel. 

6 Relative Minimax 

When the compound class W is so large that I(px, W) < R, we have seen from Corollary 15.21 that 
the simple thresholding rule F(t) = A is optimal. Even if I(px,^) > R, our minimax criterion 
(which seeks the worst-case error exponents over the class W) for designing F might be a pessimistic 
one. This drawback can be alleviated to some extent using a relative minimax principle, see [10] 
and references therein. Our proposed approach is to define two functionals a(py\x) and f3(py\x) 
and the relative error exponents 

A a Ei(R,px,PY\x,F) = Ei(R,px,p Y \x,F) - a(p Y \ x ), 
A/3E0(R,px,PY\X, F ) - E®(R,Px,Py\x,F) ~ 0(PY\x)- 

Then solve the constrained optimization problem of (|3.9p with the above functionals in place of 
Ei(R,px,Py\x, F) — a and Eq(R,px,Py\XtF). It is reasonable to choose a(py\x) and /3{py\x) 



1 Forney also studied the case E$(R) > Ei(R), which is not covered by our analysis. 
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large for "good channels" and small for very noisy channels. While a(py\x) an d P(py\x) could 
be the error exponents associated with some reference test, this is not a requirement. A possible 
choice is 

<*(Py\x) = A 

P{Py\x) = E sp (R + A,px,p Y \x) 

which are the error exponents (|5.2p and (|5.3p corresponding to the thresholding rule F(t) = A. 
Another choice is 

u{py\x) = E sp (R,px,p Y \x) + A (6.1) 
P(pr\x) = E sp (R + A,px,Py\x) (6.2) 

which are the "ideal" Forney exponents — achievable under the assumptions of Prop. [5T57 i). 

The relative minimax problem is a simple extension of the minimax problem solved earlier. 
Define the following functions: 

A a E rtF (R,px,W) = min [E r F (R,Px,PY\x)- <*(py\x)], (6-3) 

Py\x^ 

A a E sp (R,p x ,W) = min \E sp (R,px,PY\x) ~ a(PY\x)}, (6-4) 

Py\x^ 

Fr (t) = A a E sp {R,px,W)-A a E sp {R + t,p x ,W). (6.5) 

The function F^ px ^ ^ a {t) of (|6.5p is increasing and satisfies F^ px ^^ a (Q) = 0. The above functions 
A a E r ^p and A a E sp satisfy the following relationship: 

A a E rjF (R,p x ,W) ( = } min \mm[E sp (R' , Px ,p Y \x) + F(R' - R)\ - a(p Y \ x )\ 
= min <^ min \E sp (R' ,p x ,p Y \x) ~ oc(p Y \x)] + E(R' - R)] 
( = } mm[A a E sp (R',p x ,W) + F(R' - R)] (6.6) 

where (a) is obtained from (|4.7p and (|6.3p . and (b) from (|6.4p . Equation (|6.6p is of the same form 
as (14. 7p . with A a E r ^ F and A a E sp in place of E r>F — a and E sp — a, respectively. 

Analogously to (15. 4p . (|5.5p . and (15. 6p . the relative minimax for variable-size decoders is given 

by 

A/3 Efc(R,W, a) = max sup A p E r tlF -i l+ (R, Px , W) (6.7) 

Px F£T L (R,p x ,W,a) 

where the feasible set J rL (R,px, W , a) is the set of functions F that satisfy 

A a E rjF {R,p x ,W)>0 

as well as the previous continuity and monotonicity conditions. The following proposition is anal- 
ogous to Prop. 15.11 
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Proposition 6.1 The supremum over F in \6. 7| ) is achieved by 

F L *(t) = F RtPx ^ ta (t)-A a E sp (R, Px ,W) 

= -A a E sp (R + t,p x ,W), (6.8) 

independently of the choice of (3. The relative minimax is given by 

ApE^(R,W,a) =maxApE rjl{F L* ) -i l (R, P x,W). 

Proof. The proof exploits the same monotonicity property (E r p < E t q for F ■< G) that was used 
to derive the optimal F in (I5.12p . The supremum over F is obtained by following the steps of the 
proof of Prop. 15. 1\ substituting A Q E r> p, A a E sp , and FR p X) yp a for E r> p — a, E sp — a, and Fr :Px ^, 
respectively. The relative minimax is obtained by substituting the optimal F into (16. 7ft . □ 

We would like to know how much influence the reference function a has on the optimal F. For 
the "Forney reference exponent function" a of (|6.ip . we obtain the optimal F from (|6.8p and f|6.4[) : 

F L *(t) = - mm [E sp (R + t,px,p Y \x) -&(Py\x)} 

= - mm [E sp (R + t,px,p Y \x) ~ E sp (R,p x ,PY\x) ~ A] 

= A+ max .[E sp (R,px,p Y \x) ~ E sp {R + t,p x ,p Y \x)} 

= A+ max .F RtPXtPY (t). (6.9) 

Interestingly, the maximum above is often achieved by the cleanest channel in W — for which 
E sp (R,px,PY\x) is large and E sp (R + t,px,Py\x) f ans °ff rapidly as t increases. This stands in 
contrast to (15.121) which may be written as 

F L *(t) = a- mm E sp (R + t,p x ,Py\x) 
Py\x&^ 

= A+ mm E sp (R,px,p Y \x) ~ min E sp (R + t,p x ,PY\x)- (6-10) 

Py\x^ Py\x€W 

In (|6.10p . the minima are achieved by the noisiest channel at rates R and R + t, respectively. Also 
note that F L *{t) from (|6.9p is uniformly larger than F L *(t) from (|6.10p and thus results in larger 
incorrect-message exponents. 

For R > I(px,y^), Corollary 15.21 has shown that the minimax criterion is maximized by the 
thresholding rule F(t) = A which yields Ei(R,px,py\x) = A and E^(R,px ,Py\x) = E sp {R + 
A,px,Py\x) f° r an Py\x- The relative minimax criterion based on cx(py\x) °f (|6.ip yields a higher 
E\{R,px-,Py\x) f° r good channels and this is counterbalanced by a lower E^(R,px,Py\x)- Thus 
the primary advantage of the relative minimax approach is that ct(pY\x) can be- chosen to more 
finely balance the error exponents across the range of channels of interest. 

7 Compound Binary Symmetric Channel 

We have evaluated the incorrect-message and erasure exponents of (|5.24p for the compound BSC 
with crossover probability p G [p m in, Pmax] , where < p m i n < p max < \- The class W may 
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be identified with the interval [p m in> p m ax], where p m \ a and p ma x correspond to the cleanest and 
noisiest channels in W , respectively. Denote by h 2 (p) — —plogp — (1 — p)log(l — p) the binary 
entropy function, by 1 (•) the inverse of that function over the range [0, |] , and by p p the Bernoulli 
p.m.f. with parameter p. 

Capacity of the BSC is given by C(p) = l — h 2 {p)i and the sphere packing exponent by [1, p. 195] 

E sp (R,p) = D(p PR \\ Pp ) 

= Pfl log^ + (l-p R )logi^, 0<R<C(p), (7.1) 

P I"/? 

where p R = h% (1 — R) > p. The optimal input distribution px is uniform at all rates and will be 
omitted from the list of arguments of the functions E r ^p, E sp , and Fr below. The critical rate is 



Rcr(p) = 1- h 2 



1 



i + vVp 2 - 1 



and E sp (0, p) = - log y/£p(l-p). 

The capacity and sphere-packing exponent for the compound BSC are respectively given by 
C(W) = C(p max ) and 

E sp (R,W)= min E sp {R, p) = E sp (R, p max ). (7.2) 

<p<p l WAX 

For R > C{W), the optimal F is the thresholding rule of ([53]) . and (pEU) and (JOJ) yield 

Ei{R,p) = A and E$(R,p) = E sp (R + A,p). 

In the remainder of this section we consider the case i? < C(W), in which case E sp (R, W) > 0. 
Optimal F. For any < A < C{p) — R, we have p < pr+a < Rf? < 5 • F rom A7.ll) we have 

Ffl, p (t) = E sp (R, p) - E sp (R + t, p) 
= D(p PR \\p p ) - D{p PR+t \\p p ) 

= h 2 (pR) - h 2 (pR+t) + (PR - PR+t) (log log 



P 1-P 

h 2 (PR) - h 2 (p R+t ) + (pr - pR+t) log I 1 ) , t > 0, 



which is a decreasing function of p. 

Evaluating the optimal F from (|5.12p . we have 

F L *(t) = A + F R ^(t) 

= A + E sp (R,W)-E sp (R + t,' 

= A + E sp (R, p m ax) — E sp (R + t, Pmax) 

= A + F R ^(t), t>0. 
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Observe that the optimal F is determined by the noisiest channel (/? max ) and does not depend at 
all on p min . 

This contrasts with the relative minimax criterion with a(py\x) °f t|6.1|) . where evaluation of 
the optimal F from (|6.9p yields 



F L *{t) = A+ max F R>p (t) 

<p<p max 

= A + F^Ji), i>0 

which is determined by the cleanest channel (/0 mm ) and does not depend on p max - 

Optimal Error Exponents. The derivations below are simplified if instead of working with 
the crossover probability p, we use the following reparameterization: 

A — 1 _ 1 A — 1 i A — 1 i 

A* ~~ P -*-> A'max — P m i n J-j /^min — Pmax 1 



and 



p R 4 p" 1 - 1 = __ _ - 1 ^ = 1 - h 2 



h 2 L (l-R) V 1 + 

where increases monotonically from 1 to oo as R increases from to 1. With this notation, we 
have p = and /i max > p > p R+A > m > 1- Also 

dR dh 2 (p R ) __ log p R 
dpR dp R In 2 

dE sp (R,p) _ log ^ - log p R 
dpR In 2 

t-i/ / Tj \ dE S p(R,p)/dp R logp log p/ pr 

- E {R,p) = — — = - 1-— . {(.6) 

p -dR/dpR log pr log pr 

From (|5.18p and (|7.3p . we obtain 

-F/ p (i?,^) = - mm ^ p (^p) = 1 ° g 1 /imax/ ^ - (7-4) 

Pmin<P<Pmax lOg PR 

Observe that the minimizing p is p m i n , i.e., the cleanest channel in W. In contrast, from (|7.2p . we 
have 

log/Xmin/Vfi 

log/^/j 

which is determined by the noisiest channel (p m ax)- 

Conditions for universality. Next we evaluate R con \W) from (|5.19p . For a given p, the 
conjugate rate of R is obtained from () T . 3 j) : 

1 



^ p (ii, JT) = -E' sp (R, Pmax ) = < -g sp (R, W) (7.5) 



# sp (#,p) -E' sp (R^,p) 

logQ/Vfl) _ lOg jUficcnj 

lOg^/J log(p/p R conj) 
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hence 

fJ'Rconj 

R con i{p) = l-h 2 



1 + 



rT" j {W) = max R conj (n) 



'max 



— R ^(/imax) 1 

Prom (pU0|) and (fT2]> . we have 

Analogously to (|7,5p . observe that both R con \W) and R con \W) are determined by the cleanest 
and noisiest channels in respectively. 

We can now evaluate the conditions of Prop. 15.51 under which F(t) = A + A|i| + is universal 
(subject to conditions on A and A). In (|5.22j) . A must satisfy 

R conj (W)-R + < A <C{W)-R 
\R con] (^ x )-R\ + < A <C(^ min )-R 



//-. ( — !— \ -hj ' 



1 + HrJ \1 + /imax/^i? 



< A < /,, ( -J— ) - h 2 (—± ). (7.6) 



The left side is zero if // max < Mij- If /^max > I^r, the argument of | • | + is positive, and we need 
/^max/^R < A'min to ensure that the left side is lower than the right side. Hence there exists a 
nonempty range of values of A satisfying (17.61) if and only if 

ji R > mm < y/Ji 



max; 

which may also be written as 

/"max < max{/i| 5 HRHmin}. 

Next, substituting (|7.4p into (|5.23p . we obtain the following condition for A: 

log faax/^fi < \ < log HR+A ^ ^ 

log^i? ~~ ~ log/U max //i fi+ A ' 

This equation has a solution if and only if the left side does not exceed the right side, i.e., 

^max < ^Rl^R+A, (7.8) 

or equivalently, p m i n > (1 + PrI^r+a) 1 ■ Since hr is an increasing function of R, the larger the 
values of R and A, the lower the value of p m m for which the universality property still holds. 

If equality holds in (|7.8p , the only feasible value of A is 

A = l ^±± > 1. (7.9) 
log ma 

This value of A remains feasible if f|T.8P holds with strict inequality. 
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Figure 2: Erasure and incorrect-message exponents Eq(R) and E{(R) for BSC with crossover prob- 
ability p = 0.1 (capacity C(p) ~ 0.53) and rate R = 0.1. 

8 Discussion 

The .F-MMI decision rule of (|3.ip is a generalization of Csiszar and Korner's MMI decoder with 
erasure option. The weighting function F in (|3,ip can be optimized in an asymptotic Neyman- 
Pearson sense given a compound class of channels W . An explicit formula has been derived in 
terms of the sphere-packing exponent function for F that maximizes the erasure exponent subject 
to a constraint on the incorrect-message exponent. The optimal F is generally nonunique but agrees 
with existing designs in special cases of interest. 

In particular, Corollary 15.21 shows that the simple thresholding rule F(t) = A is optimal if 
R > I{pxi i- e -> when the transmission rate cannot be reliably supported by the worst channel 
in W . When R < I{px,^\ Prop. 15.51 shows that for small erasure exponents, our expressions 
for the optimal exponents coincide with those derived by Forney [3] for symmetric channels, where 
the same input distribution px is optimal at all rates. In this regime, Csiszar and Korner's rule 
F(t) = A + A|t| + is also universal under some conditions on the parameter pair (A, A). It is also 
worth noting that while suboptimal, the design F(t) = A + i yields an empirical version of Forney's 
simple decision rule (12. 5ft . 

Previous work [5] using a different universal decoder had shown that Forney's exponents can be 
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matched in the special case where undetected-error and erasure exponents are equal (corresponding 
to T = in Forney's rule (12, 4p ). Our results show that this property extends beyond this special 
case, albeit not everywhere. 

Another analogy between Forney's suboptimal decision rule (|2.5p and ours (|3.ip is that the 
former is based on the two highest likelihood scores, and the latter is based on the two highest 
empirical mutual information scores. Our results imply that (|2,5p in optimal (in terms of error 
exponents) in the special regime identified above. 

The relative minimax criterion of Sec. [6] is attractive when the compound class W is broad (or 
difficult to pick) as it allows finer tuning of the error exponents for different channels in W . The 
class W could conceivably be as large as &y\x-> the set of all DMC's. 

Finally, we have extended our framework to decoding for compound multiple access channels. 
Those results will be presented elsewhere. 

Acknowledgements. The author thanks Shankar Sadasivam for numerical evaluation of the 
error exponent formulas in Sec. and Prof. Lizhong Zheng for helpful comments. 
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A Proof of Proposition 14.11 

(i) is immediate from (|4,3p . restated below: 



E r , F (R,p x ,PY\x) = ™n[D(p Y \ x \\PY\x \px) + F(I(p x ,p Y \x) ~ R)], (A.l) 

Py\x 

and the fact that F is nondecreasing. 

(ii) is immediate for the same reason as above. 

(hi) Since the function E sp (R,px,PY\x) 1S decreasing in R for R < I(px,Py\x)> we have 

E sp (R,px,p Y \x) = min D(p Y \ x \\PY\x\Px), Vi? < I{px,Py\x)- (A.2) 

Py\x ■ I(Px,Py\x}=R 

Since F is nondecreasing, py\x that achieves the minimum in (|A.1|) must satisfy I(px,Py\x) — 
I(PX,Py\x)- Hence 

E r , F (R,p x ,p Y \x) 

= , min \D{p Y \x\\PY\x\Px) + F{I{px,p Y \x) ~ R)] 

Py\x ■ I(Px,Py\x)<I(Px,Py\x) 



min \D{py\x\\Py\x\px) + F{R' - R)} 

I(vx,Pv\x)=R 



= mm 

R'<I(px,Py\x) Py\x -Hpx,Py\x) 

= min [E sp (R',px,p Y \x) + F(R' - R)} 

R'<I(Px,Py\x) 

( = ] mm[E sp (R',px,p Y \x) + F{R' - R)\ 

where (a) is due to (|A.2j) . and (b) holds because E sp {R' ,px ,Py\x) = f° r R' > I{px-,Py\x) an d F 
is nondecreasing. 

(iv) The claim follows directly from the definitions (|4.2h and (|4.4I) . taking minima over □ 

B Proof of Proposition 14.21 

Given the p.m.f. px, choose any type p x such that max xg ^ \p x (x) — px(x)\ < El Define 



and 



E r ^ N (R,p x ,p Y \x) = mm[D(p y \ x \\pY\x\p x ) + F(I(x;y) - R)] (B.l) 

Py|x 

E sp ,n(R,Px,Py\x) = min £>(p y | x [|Py|x|Px) (B.2) 
P y |x :^(x;y)<i? 

which differ from (|4.3p and (|4.ip in that the minimization is performed over conditional types 
instead of general conditional p.m.f. 's. We have 

lim E rFN (R,p x ,pY\ x ) = E rtF (R,px,PY\x) 

N^oo 

lim E SP:N (R,Px,Py\x) = E sp (R,p x ,PY\x) (B.3) 

N— >oo 



2 For instance, truncate each px(x) down to the nearest integer multiple of 1/N and add a/N to the smallest 
resulting value to obtain p x (x), x € X, summing to one. a is an integer in the range {0, 1, • • ■ , \X\ — 1}. 
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by continuity of the divergence functional and of F in the region where the minimand D + F is 
finite. 

We will use the following two standard inequalities. 

1) Given an arbitrary sequence y, draw x' independently of y and uniformly over a fixed type 
class T x . Then [1] 

Pr[Tx 1 = = ^I^A = 2 -^(x';y). 

1 |yJ |T X | \T X >\ 

Hence for any < v < H(p x ), 

Pr[/(x'; y)>u] = £ Pr[T x , ]y ] l { / ( ^ ;y) >, } 

T x'|y 

= E 2 ^ /(x ' ;y) 

(a) 



T x'|y 



max2-^( x '^l {/(x , ;y)> , } 

= T Nv (B.4) 

where (a) holds because the number of types is polynomial in N. For v > H(p x ) we have 
Pr[/(x';y) >u] = 0. 

2) Given an arbitrary sequence x, draw y from the conditional p.m.f. py| X (-|x). We have [1] 

Pr[T y | x ] = 2~ ND{ - p y\^\ x ^\ (B.5) 

Then for any v > 0, 

Pr[/(x;y)<^] = ^ Pr[T y , x ] 1 { ; M <„ } 



T y|* 



^ ^ 2 -ND(p y|x ||p y|x b x):t{7(x;y) ^ } 

Py|x 

= max2-^Wlxll^lxbx) 1{/(x;y) ^ iy} 

Py|x 

= max 2 _ArD ( p yl x " Py l x ' Px ^ 
P y |x : J(x;y)<f 

_ 2- ne s P ,n(v,p*,py\x) _ (B.6) 

Incorrect Messages. The codewords are drawn independently and uniformly from type class 
T x . Since the conditional error probability is independent of the transmitted message, assume 
without loss of generality that message m = 1 was transmitted. An incorrect codeword x(i) 
appears on the decoder's list if i > 1 and 

/(x(i);y) >P + maxP(/(x(i);y)-P). 
Let x = x(l). To evaluate the expected number of incorrect codewords on the list, we first fix y. 
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Given y, define the i.i.d. random variables Z{ = /(x(i);y) — R for 2 < i < 2 NR . Also let 
z\ = J(x;y) — R, which is a function of the joint type p xy . The expected number of incorrect 
codewords on the list depends on (x, y) only via their joint type and is given by 

oJVfl 



E[iVi|T xy ] = Y. Pr 

i=2 
2 NR 

= J2 Pr 



i=2 
2 NR 



/(x(i);y) > R+ max F(/(x(j);y) - R) 

j£M\{i} 



Zi > max F(Zj) 

j<£M\{i} 



< £ Pr[Zi > F{z x )] 



i=2 

= (2 NR -l)Pr[Z 2 >F( Zl )] 

(«) {2 nr _ 1} pr[/(x /. y) > R + F(/(x . y) _ R)] 

(b) 

^ 2NR 2 -N[R+F(I(x;y)-R)] 

where in (a), x' is drawn independently of y and uniformly over the type class T x ; and (b) is 
obtained by application of (|B.4p . 
Averaging over y, we obtain 



E[JVi|T x ] = ^Pr^E^T^] 

T 
y x 

= maxPr[T ylx }E[Ni\Txy} 

(a) 

= maxexp 2 {-iV[D(p y | x ||py| X |p x ) + F(I(x;y) - i?)]} 

Py|x 

— exp 2 {-iV J B riJ F iA r(i?,p x ,py| X )} 

(c) 

= exp 2 {-NE T:F (R,px,p Y \x)} 
where (a), (b), (c) follow from (jET5|) . (|BTT|) . fOjl . respectively. This proves (|Q]> . 
Erasure. The decoder fails to return the transmitted codeword x = x(l) if 

J(x;y)<fl + max F(/(x(i);y) - R). (B.8) 

2<i<2 JV « 

Denote by P(j(p x ) the probability of this event. The event is the disjoint union of events E\ and S% 
below. The first one is 

Si: I(x;y) <R + F(0). (B.9) 
Since F~ l is increasing, £i is equivalent to i ?_1 (/(x;y) — i?) < 0. The second event is 
Si: R + F(0) < J(x; y) < fl+ max F(J(x(i);y) - i?) 

2<i<2 Jvii 

= R + F{ max J(x(i);y)-i? 

\2<i<2 Jv fl 
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where equality holds because F is nondecreasing. Thus £ 2 is equivalent to 

max I(x(i);y)>R + F~ 1 (I(x;y)-R)>R. 

2<i<2 NR 



(B.10) 



Applying (jB.6[) . we have 

Pr[£i] = p$ ) (p x ) = Pr[J(x;y) < P + P(0)] 



ex P2 {-^^sp,^ (P + P(0),p x , P y| X )} 



fB.ll) 



Clearly p ( Px ) ~ p«(p x ) ~ 1 if R+ F(0) > I( Px ,Py\x)- 
Next we have 



Pr\£o 



(T. 



Pr 
Pr 



max /(x(») ; y ) > R + P" 1 (/(x; y) - P) 

2<i<2 JVB 



max Zj> F (zi) 

2<i<2 NR 



(a) 



2 -iVF- 1 (zi) 
2 -7VF- 1 (/(x;y)-R) 



where (a) follows by application of the union bound and (|B.4|) . 
Averaging over y, we obtain 

Pr[Py|x]^ 2) (Py|x) 



P0 2) (Px) = 



T y , x :/(x;y)>R+F(0) 



max Pr[T i x ]pl 2) (T 



T y | x :/(x;y)>R+F(0) 



p y | x :/(x;y)>fl+F(0) 

For R + P(0) < I(Pil,Py\x)j we have 

max exp 2 {-iVP(p y | x ||py| X |p x )} = 



max exp 2 {-^[P(p y | x ||p y | X |p x )+p- 1 (/(x;y)-P)]}. (B.12) 



max , exp 2 {-ND(p y i x \\p Y \x\Px)}, 

Py|x : /(x;y)=R+F(0) 



hence ()B.12p may be written more simply as 

pf( Px ) = maxexp 2 {-iV[P(p y | x ||p y | X |p x ) + |p- 1 (/(x;y)-P)|+]} 

Py x 

= exp 2 {-A r P ri | F -i| +Ar (P,p x ,py| X )}. 
Since £1 and £ 2 are disjoint events, we obtain 

P0(Px) = Z>0 fax) +P0 2) (Px) 

= exp 2 {-iVmin{P, ri | F -i| +iA r(P,p x ,py| X ), E sp , N (R + F(0),p x ,p Y \ x )}} 
= exp 2 {-iVmin{P r ^-i| + (P,px,PY|x), P sp (P + P(0),px,Py|x)}} 



(B.13) 
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where the last line is due to (|B.3p . 

The function has a zero-crossing at t = F(0). Applying (|4.7j) . we obtain 

E r ,\F-i\+(R,Px,P Y \x) = mmiE^R^px^Y^ + lF-^R' - R)\+) 

< E sp (R + F(0),px,p Y \x)+0 

hence 

P$(Px) = exV2{- NE r,\F-i\+(R,PX,PY\x)}- 

This proves (|4.9p . □ 



C Proof of Proposition 15.11 

We first prove (15. 12ft . Recall (14. 7j) and (14. 6ft . restated here for convenience: 
E r , F (R,p x ,W) = mm[E sp (R',p x ^)+F(R' - R)], 



Hence 



R> 

E sp (R, Px ,W) = E sp (R',p x ,W)+F R , Px ^(R' - R), V#. 

Er,F R , Px ^(R,Px,^) =E sp (R, Px ,W). (C.l) 

Case I : E sp (R,px,^) = a, i.e., from (|5.9p . we have A = 0. The feasible set !F L (R,px,W ,ol) 
defined in ((El]) takes the form {F : E r , F (R,p x ,W) > E sp (R,p x ,W)}- Owing to (ftTH and 
the monotonicity property of Prop. I4.1( ii). an equivalent representation of J :L (R,px, a) is 
{F : F y Fji tPx ^}. As indicated below the statement of Prop. 15.14 this implies Fr jPx ^ achieves 
the supremum in (|5.5|) . 

Case II : E sp (R,px, ^ a. The derivation parallels that of Case I. An equivalent representa- 
tion of the constrained set J :L (R,px, "W, a) in (IBTHl) is {F : F y F L *}, where 

F L *(t) = F RtPx ^(t) + a-E sp (R,p x ,W) 
= F RtPXi yp(t) + A 
= a-E sp {R + t,p x ,W). 

Hence F L * achieves the maximum in (15.5p . 

To prove (pU3|) . we simply observe that if F L * ^ F for all F e F L (R,p x , a), then F* ^ F 
for all F € T{R,p x ,W,a), where F*(t) = max(i, F L *(t)). □ 

D Proof of Proposition 15.31 

We have 

E^(R,p x ,W,a) = E r ^ F L«yi\+{R,px^) 

( = ] rnin^^px,^) + \(F L *yY (R' - R)] 

= ^ mm [E sp (R',p x ,W) + {F L *)-\R'-R)) (D.l) 
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where equality (a) results from Props. 1431 and [5TT1 (b) follows from (|4.7p . and (c) from the definition 
()3.12p and the fact that the function E sp {R' ,p x ,W) is nonincreasing in R' . From (j5. 12[) and 
property (P3) in Sec. [3] we obtain the inverse function 

(F L *y\t)=F-l xW {t-&) (D.2) 

where F^ px ^(t) is given in (|4.5p . Hence ttpL*\-\ = A and 

Efr(R,p x ,W,a) = mm \E sp (R' , PX ,W) + F^ px ^(R' - R - A)}. (D.3) 

h{R') 

By assumption E sp (R,px, W) is convex in i2, and therefore F^ Px ^{t) is concave. By applica- 
tion of Property (P5) in Sec. [3l the function F Rp ^ is convex, and thus so is h(R') in (|D.3p . The 
derivatives of Fr Px and h are respectively given by 

{F &*>* m = F' R ^ w {t) = -E> p (R + t, Px ,W) (D ' 4) 



and 



^) = Ei p (R', PX ,W) + _ E , p{R ,_ ApXt 



By Prop. [5J] and the definition of A in we have Ei(R,p x ,W) = a = E sp (R,p x , W) + A. 

Next we prove the statements (i) — (iv). 

(i) max(0, R con i( Px , W) — R) < A < I(p x , W) - R. 
This case is illustrated in Fig. [3j We have 

R + A > max(i?, R con \p x , W)) > R cr ( P x, V). (D.5) 

Hence 

f E' p (R,px, W) + „ E , p( l Px ^ } > : if R > R cr ( P x,W) 

~ { E' sp (R^{p x ,W),px, W) + _ E , v{ l Px ^ =0 : if R < R cr ( P x,^) 
> 0. (D.6) 

By convexity of h{-) this implies that R + A minimizes h(R') over R' > R + A, and so 

Efr(R, Px ,W,a)=h(R + A) = E sp (R + A,p x ,W). 



(ii). Due to (jDiijl . we have either > R cr (p x , W) or i? + A > R con i(p x , W) > R. In both cases, 
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0.1 0.2 0.3 0.4 0.5 0.6 0.7 



Rate (FT) -> 



Figure 3: Construction and minimization of h(R') for Case (i). 

Let F(t) = A + X\t\ + where A is sandwiched by the left and right sides of the above inequality. We 
have tp = and F'(t) = Al{ t >o}. The inverse function is F~ 1 (t) = j(t — A) for t > A. Hence 
(F~ l )'(t) = ± l{ t >A} and t F -i = A. Substituting F and F' 1 into (j4~7l) . we obtain 

E r}F (R,p x ,W) = min[£ sp (i2',px,*0 + A + A(i2'-i?)], 



E sp (R',Px, W) + ^(R' — R — A) 



A > -E' sp {R, Px ,W), - > -E' sp (R + A, Px ,W), 



Taking derivatives of the bracketed terms with respect to R' and recalling that 

1 
A 

we observe that these derivatives are nonnegative. Since Egp^px,^) is convex, the minima are 
achieved at R and R + A respectively. 

The resulting exponents are E sp (R,px, W) + A and E sp {R + A,px, W) which coincide with the 
optimal exponents of (|5.14p , 

(iii). R < R cr {px, and < A < R con i( Px ,W) - R. 

This case is illustrated in Fig. [J] in the case A = 0. From (|D.6p . we have h'(R') = if and only 
if R' and R' — A are conjugate rates. In this case, using the above assumption on A, we have 

R< R' — A = Ri (A) < R cr ( Px ,^) < R' = # 2 (A) < R conj {Px, W). (D.7) 
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0.2 0.3 0.4 

Rate (Ft') -> 



0.7 



Figure 4: Construction and minimization of h(R') for Case (ii), with A = 0. 



Hence R 2 {A) = R' > R + A is feasible for (|D.3|) and minimizes /i(-). Substituting it!' back into 
()D.3p . we obtain 

E%(R,p x ,W,a) = h(R 2 (A)) = E sp (R 2 (A),p x , W) + F^^R^A) - fl) 



which establishes (|5.16|) . 

(iv). R < R cr (p X ,W) and i^px, W)-R<A<0. 

Again we have h'(R') = if and only if R' and R' — A are conjugate rates. Then, using the 
above assumption on A, we have 



R < R' = -Ri(A) <R C rij)x^)< R' - A = R 2 (A). 



(D. 



Hence i?i(A) = R' > R + A is feasible for (|D.3|) and minimizes /i(-). Substituting it!' back into 
(|D.3p . we obtain 

E%{R,p x ,Wa) = h(Rx(A)) = E sp {R x {A),p x ,W) + F^ px>w {R 2 {A) - R) 
which establishes (15.171) . □ 
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E Proof of Lemma 15.41 



First we prove (|5,20p . For any Rq < R\, we have 

■Ri i\ rRi 

E (R,p x ,'% / )dR = / mm E'(R,p x ,p Y \x)dR 
R JRo Py\x&P 



rRi 

< min / E' (R,p x ,p Y \x)dR 

PY\X^jR n 



= min . \E sp (Ri,px,PY\x) ~ E sp (R ,Px,Py\x)} 

Py\x&P 

(b) 

< E sp (R 1 ,p x ,p Y \x) ~ E sp (Ro,px,P Y \x) 

= min E sp (Rx,p x , p Y \x) ~ E sp {R ,px,P Y \x) 

PY\x€W 

< min E sp (Ri,px,p Y \x) ~ min E sp (Ro,p x , Py\x) 
Py\x&? r>Y\x&* 

= E sp (R 1 ,p x ,W)-E sp {R ,px,W) 

= [ 1 E ' (R, Px ,W)dR (E.l) 
JRo 

where (a) follows from the definition of E sp in (|5.18p . and we choose p Y \x m inequality (b) as the 
minimizer of E sp (Ri,px,-) over # '. Since (jE.ip holds for all Ro < R±, we must have inequality 
between the integrands in the left and right sides: E' sp (R,px ,W) < E' sp (R^px^)- Moreover, the 
three inequalities used to derive (jE.ip hold with equality if the same p Y \x minimizes E' sp (R,px, •) 
at all rates, and the same p Y \ x minimizes E sp (R,px, •) at all rates. We need not (and generally do 
not) have p Y ^ x = p Y ^ x . H 

Next we prove (|5.2ip . By definition of R con ^{px, TV), we have 

-E'(R^{px,w),p x r 1 



spy ^' -E' sp (R,p x ,W) 

(a) l I 

- tp' in in = min - = mm [-E' (R coni (p x ,p Y \x),Px,PY\x)} 

-E! sp {R,p x ,y^) py\x£W -E' sp {R,px,Py\x) vy\x&T p ' 

(6) _ 

> min \-E' sp (R con ° ( Px ,W\p x , Py\x)\ 
Vy\x€W 

-E! sp (W' JUJ (px,W),px,W) 
(d) _ 

> -E' sp (R con \p x ,W),px,W). 

where (a) and (d) are due to (|5.20p . (b) to the definition of R con \px, W) in (|5.19p and the fact that 
—E' sp (R,px,PY\x) is a decreasing function of R, and (c) from (15.181) . Since —E' sp (R,px, W) is also 

a decreasing function of R, we must have R con \px, W) < R 00 ^ (px, W). Moreover, the conditions 
for equality are the same as those for equality in (|5.20p . □ 



3 While p Y \x i g the noisiest channel in W, Py\x ma y be the cleanest channel in W, as in the BSC example of 
Sec. [3 
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